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ABSTRACT 


Availability  quantifies  the  propensity  of  a  system  to  be  functionally 
operative  upon  demand.  It  increases  if  operating  times  between  failures 
(“up  times”)  are  long,  and  decreases  if,  following  failure  or  anticipatory 
removal,  logistics  delays  and  repair  (“down  times”)  are  protracted.  This 
chapter  summarizes  the  general  availability  concept  and  discusses  the 
limitation  of  operational  availability  suggesting  that  mission  availability  is 
often  more  useful  and  appropriate. 


SYSTEM  AVAILABILITY 


1.  OVERVIEW 

All  mechanical,  electrical,  nuclear  power  generation,  propulsion,  weapons 
systems,  computer  and  data  storage  systems  (both  hard  and  software),  communications, 
biological/medical,  and  many  combinations  thereof  that  are  designed,  manufactured,  and 
tested,  operated  and  maintained  with  human  skills,  engineering,  and  scientific  knowledge, 
are  subject  to  failure.  This  means  that  they  become  unable  to  perform  the  intended  design 
function,  or  mission.  Reliability  is  a  measure  of  the  propensity  of  component,  subsystem, 
or  system  (of  systems)  to  satisfactorily  perfonn  a  required  designed  performance  function 
on,  say,  a  military  or  homeland  security  mission.  This  includes  effectively  and  promptly 
responding  to  a  natural  disaster  such  as  a  hurricane  (Katrina),  tornado,  or  earthquake. 
During  the  period  of  inability  to  perfonn,  the  item  is  said  to  be  not  available  or 
unavailable.  Often  this  propensity  is  quantified  as  a  (conditional)  probability,  where  the 
conditions  include  stressful  aspects  of  the  operational  environment  and  usage.  Note  that  it 
is  not  customary  to  include  vulnerability  to  opponent  action  in  measures  of  military  or 
security  system  reliability.  Many,  even  most,  failed  systems  are  subject  to  failure 
conection  or  rectification,  which  often  requires  a  non-negligible  total  down  time,  during 
which  time  interval  they  are  completely  or  partially  unavailable.  Thus  an  initial  informal 
definition  of  availability  can  be  an  estimated  probability  that,  at  the  moment  of 
demand/need  for  a  system’s  performance,  it  will  furnish  that  service,  and  not  be  in  the 
process  of  awaiting  for  or  being  rectified/repaired/replaced.  Note  that  field  support  of 
system  availability  and  capability  to  accomplish  a  mission  profile  depends  upon 
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sustainment  resources :  personnel,  logistics/spare  parts,  tow  vehicles,  etc.  These  must  also 
be  available  when  needed. 

It  is  often  true  that  the  failure  propensity  measure,  or  hazard,  or  hazard  rate,  of  a 
component,  subsystem,  or  system  eventually  increases  with  time  or  intensity  of  usage  and 
“wear”  or  “age”  after  being  fielded.  Typically,  a  new  or  modified/upgraded  system  is 
tested  so  as  to  identify  and  remove  or  reduce  the  effect  of  design  defects  or  failure  modes', 
during  the  early  testing  period  every  attempt  is  made  to  induce  the  activation,  thus 
exposure  of,  design  defects  or  failure  modes,  i.e.,  to  stimulate  occurrences  of  failures  and 
remove  their  causes;  see  [1].  Some  faults  inevitably  remain,  and  can  be  found  during 
routine  scheduled  maintenance.  However,  much  such  maintenance  can  be  unnecessary 
(time-scheduled  maintenance  is  costly  in  time  and  resources,  and  may  well  find  nothing). 
It  is  currently  thought  to  be  desirable,  and  increasingly  technically  feasible,  to  identify 
and  automatically  track  system  health  changes:  failure-preceding  or  prognostic  events 
(excess  heat,  undue  vibration,  wear  of  say  tire  treads,  tire,  engine  oil  condition,  etc.)  that 
can  signal  an  incipient  failure  and  thereby  suggest  needed  system  design  or  operational 
changes  before  total,  possibly  catastrophic,  failure  can  occur.  This  procedure  is  known  as 
Condition-Based  Maintenance  (CBM),  and  makes  use  of  currently  available  automated 
embedded  sensors;  hence  CBM+.  However,  the  CBM+  subsystem  is  itself  subject  to 
malfunction  and  failure,  so  false  positive  and  false  negative  diagnoses  must  be 
anticipated  and  made  rare  in  order  for  overall  gains  to  be  made.  Thus  the 
reliability/availability  of  CBM+  must  itself  be  considered  as  part  of  the  total 
system  package. 
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Accelerated  testing  procedures  are  often  used  at  an  early  (developmental)  testing 
stage;  see  [2],  During  the  testing  phase,  both  Developmental  and  Operational,  it  is 
intended,  likely,  and  desirable,  to  find  design  faults  and,  desirably,  “fix”  them,  so  that 
they  will  not  occur/activate  and  cause  mission- abortive  events  under  active  field 
conditions.  It  is  the  usual  objective  to  test  and  “fix”  until  times  between  failures  during 
testing  become  a  stable-in-time  random  process,  often  represented  by  a  time-between- 
successivefiailures  distribution,  with  the  further  assumption  that  the  times  between 
failures  are  statistically  independent  ( iid  random  variables=IIDRV);  see  [2],  and  several 
computer  packages,  such  as  RELIASOFT  (www.reliasoft.com). 

However,  initial  tests  of  a  newly  designed  system  typically  reveal  “infant  failures” 
that  should  be  removed  before  any  approximate  stability  of  times-between-failures 
occurs.  Such  removal  is  referred  to  as  reliability  growth,  and  it  has  been  modeled  and 
analyzed  extensively,  e.g.,  by  [3]  and  later  [4],  e.g.,  in  the  software  package 
RELIASOFT;  see  also  [1]  for  discussion  of  testing  for  growth  using  a  sequential  (run  of 
successes)  stopping  rule;  such  testing  is  made  more  complex  if  the  system  performance  is 
in  sequential  stages;  failure  of  an  early  stage  may  mask  the  exposure  of  later  stages,  thus 
requiring  longer  tests  so  as  to  explore  all  stages  for  faults.  After  {if  there  is  evidence  of 
failure  time  stochastic  stability  it  is  meaningful  to  speak  of  the  mean  {operating)  time 
between  failures',  this  is  often  specified  as  a  reliability  metric,  but  an  estimate  of  the 
probability  of  mission  success  is  usually  more  relevant,  certainly  if  the  item  is  a  one-shot, 
or  is  usually-quiescent  in  performance:  a  piece  of  ammunition  or  a  communication 
system  for  emergencies.  Caution  is  required:  justification  for  such  a  stable  failure-time 
distribution  must  depend  on  actual  data  monitoring,  and  the  currently  appropriate 
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particular  distributional  form  (model)  may  well  depend  on  environmental  conditions, 
including  maintenance  skill,  and  usage  and  environment;  e.g.,  operating  times  between 
successive  failures  for  a  particular  equipment  (helicopter)  in  the  Arctic  can  be  expected  to 
differ  from  those  operating  under  tropical,  sandy,  desert  conditions. 

Upon  failure,  some  (but  not  all,  e.g.,  expendable  missiles  and  ammunition)  system 
components  are  repaired;  this  requires  maintenance  facilities  and  personnel,  and  logistics 
(spare  parts);  the  system  may  then  be  again  capable  of  performing  designed-for  functions. 
Repair  or  replacement  is  intended  to  restore  original  capability,  or  possibly  upgrade  (or, 
in  fact,  actually  unintentionally  and  occasionally  downgrade)  system  function.  Preventive 
maintenance,  based  on  the  observed  or  inferred/sensed  physical  condition  of  the  system, 
may  be  used  to  ward  off  catastrophic  failures.  As  stated  earlier,  current  attention  is  being 
focused  on  automated  information-system-supported  condition-based  maintenance,  or 
CBM+  (alternatively,  Integrated  Vehicle  Health  Management,  IVHM).  Such  subsystems 
continuously  monitor  the  “health  condition”  of  other  subsystems  and  provide  warning  of 
imminent  malfunction/failure.  If  successful,  such  a  capability  should  allow  more 
operative  mission  hours  and  shorter  and  less  time-consuming  and  costly  repairs.  Success 
depends  on  the  trustworthiness  of  the  CBM+/IVHM  systems  available;  see  [5],  [6], 
and  [7]. 

This  chapter  considers  that  times  or  events  between  such  system  failures  (during 
functional  usage)  are  punctuated  by  periods  of  “down-time,”  during  which  the  system 
(copy  or  version  thereof)  is  (a)  awaiting  maintenance,  which  may  include  moving,  or 
being  moved,  to  a  repair  facility,  and  awaiting  the  arrival  of  parts,  instrumentation,  and 
personnel,  (b)  undergoing  diagnosis  of  faults  and  subsequent  maintenance  actions,  or 
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(c)  awaiting  further  operational  assignment  (in  idle  standby);  items  may  fail  or  degrade 
during  such  waiting/idle  times,  and  so  preliminary  tests  may  be  performed  prior  to 
mission  assignment:  pre-mission  testing  is  often  done,  especially  with  aircraft  and  other 
platforms.  This  pennits  the  detection  of  imperfect  repair  before  the  mission  actually  is 
in  progress. 

The  (classical)  System  Availability  (Ao)  parameter  is  a  measure  of  the  capability 
of  a  system  to  begin  to  perfonn  the  designed-for  function  or  mission  on  immediate 
“ random ”  demand.  Nominally,  this  would  imply  that  the  system  is  in  state  (c)  above,  but 
too  long  a  sojourn  of  idleness,  (in  “cold  standby”)  in  a  hostile  environment  can  induce 
startup  unavailability.  Mission  availability,  measuring  the  capability  of  a  system  to 
perform  its  function  from  demand  for  service  until  mission  tennination  is  usually  of  far 
more  operational  significance  than  simple  Ao.  The  latter  is  too  often  treated — 
inappropriately — as  “the  long-run  fraction  of  time”  (or  “constant  probability”)  that  a 
component  or  system  is  “up”  or  completely  mission  capable  at  the  time  of  demand  for 
operation.  In  reality,  such  a  constant  value  may  not  prevail;  “the”  value  of  Ao  may  well 
change  with  time  and  usage,  including  maintenance  variation  between  individual  copies 
of  a  system  design.  Often  no  account  is  taken  of  system  age  or  condition,  or  the  variable 
mission  environment  that  may  affect  it.  Further,  a  system’s  availability  may  be  partial: 
e.g.,  a  multi-engine  vehicle  can  still  function  if  degraded  by  the  loss  of  one  out  of  several 
engines.  However,  loss  of  an  engine  or,  one  of  several  sensors,  or  a  communication  link 
sometimes  allows  a  modified  or  limited  version  of  the  intended  mission  to  proceed  until  a 
replacement  is  furnished.  Consequently,  availability  need  not  be  a  binary 
instantaneous  concept. 
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The  object  of  this  chapter  is  to  describe  methods  in  use  for  describing,  measuring, 
testing,  and  improving  Availability  and  to  provide  cautionary  comments  similar  to 
those  above. 

2,  COMPONENT  AND  SYSTEM  OPERATIONAL  CLASSIFICATION 

Engineered  systems  of  all  types  are  composed  of  numbers  of  components 
(“parts”),  subsystems,  and  entire  systems  designed  to  perform  cooperative  functions. 
Modern  warfare  has  been  said  to  be  conducted  by  “systems  of  systems.”  Such  systems 
usually  operate  efficiently  and  effectively  according  to  correct  design,  manufacture  and 
usage.  They  inevitably  fail  or  degrade  with  age  and  stress;  some  failure  modes  are 
mission  and  even  life-threatening,  and  can  occur  with  little  warning.  Hence, 
condition-based  monitoring  (CBM)  and  preventive  maintenance  (PM)  are  required  and 
invoked;  time  spent  in  such  makes  the  entity  temporarily  mission-unavailable,  but  can 
provide  greater  overall  net  system  mission  availability. 

Systems,  and  especially  subsystems  and  components,  can  be  roughly 
categorized  as 

•  One-Shot/Disposable  (e.g.,  fuel,  ammunition,  missiles  or  mines,  electronic 
components  such  as  computer  chips  or  screens,  remote  vulnerable  sensors, 
etc.);  and/or 

•  Repairable  (Segmentally)  -  Replaceable  (e.g.,  failure-prone  engines  and 
control  devices,  vehicle  chassis,  computer  hardware,  elements  of 
communication  networks,  etc.).  An  injured  or  wounded  human  can 
sometimes  be  included  in  this  category,  as  being  part  of  an  entire  system. 
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In  many  cases,  a  failed  repairable  system  becomes  less  susceptible  to  repair  if  too 
much  time  elapses,  so  the  particular  item's  availability  tenninates,  but  its  function  is 
performed  by  a  replacement  or  substitute;  the  latter  can  be  a  redesigned  upgrade;  see  [1]. 
Further,  the  pattern  of  component/system  usage  can  vary:  many  items,  such  as  some 
sensors  and  alarm  systems  and  engines  and  generators,  are  normally  running  or  “hot” 
during  a  mission,  so  if  failure  occurs  or  becomes  imminent,  symptoms  are  evident.  Others 
are  normally  inactive  or  “cold,”  such  as  an  idle  aircraft  parked  on  a  flight  deck,  or  a 
rescue  vehicle. 

Many  systems  are  made  up  of  complex  assemblies  or  combinations  of 
One-Shot/Disposable  and  Repairable/Replaceable  subsystems:  a  vehicle  or  platform 
bums  fuel  and  may  fire  ammunition,  both  disposable,  but  its  chassis,  propulsion,  and 
steering  and  sensors  and  communications  are  very  often  repairable/replaceable.  Many 
vehicles  operate  in  groups:  convoys  of  trucks  or  small  boats,  task  groups,  aircraft 
squadrons,  etc.;  here  the  mission  may  involve  all  elements  of  the  group,  so  the  timely 
availability  of  such  a  subforce  at  a  particular  site  can  be  spoiled  if  just  one  of  the  member 
elements  fails:  the  entire  system  may  become  mission-incapable  or  unavailable  if  one  or 
more  of  such  elements  fails  or  is  damaged  by  opponent  action  (when  in  military  or 
homeland  protection  application),  or  by  owner/user  mishandling.  The  tendency  for  such 
to  occur  is  enhanced  when  hostilities  occur,  and  when  the  sustainment/support  forces  and 
logistics  are  overwhelmed  and  themselves  unavailable. 

3.  MODELING  AVAILABILITY 

System  availability  has  been  characterized  probabilistically  or  stochastically  for 
many  years;  an  initial  classic  is  [8],  but  aspects  of  repairable  system  availability  go  far 
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back  to  Erlang,  Khinchine,  and  to  Palm  in  the  1930s  and  a  further  few  decades,  well 
summarized  in  [9]  as  the  “repairman  problem”  or  service-system.  See  also  [10]  and  [11] 
on  survival  analysis,  essentially  summarizing  biological  (e.g.,  animal,  human 
“reliability”);  failure-prone,  but  repairable,  systems  resemble  biological  epidemics: 
failures  of  subsystems  can  stress,  or  infect,  other  subsystems  that,  in  turn,  can  cause  total 
system  collapse  if  not  quickly  isolated  and  mitigated. 

The  simplest  analytical  model  for  a  failure  prone,  but  repairable,  system  is  the 
alternating  renewal  process  (see,  however,  practical  cautions  in  Section  1).  Letting  a 
sequence  of  up-times,  {£/]•},  and  a  sequence  of  down-times,  {  Z)  } ,  be  independent 

sequences  of  independent  random  variables  with  marginal  distributions  Fv{u\6)  and 

G[)  ,  respectively,  and  8  and  cp  representing  parameters;  then,  if  A(t)  is  the 

availability  at  time  t ,  given  that  the  subsystem  starts  at  the  beginning  of  an  up  period, 
U] ,  say,  we  have  the  backward  recurrence  renewal  equation 

t 

A(t)  =  P{Ux>t)  +  \p{Ux+Dl&dx\A(t-x),  (1) 

o 

which  can  be  solved  numerically  or  in  terms  of  Laplace-Stieltjes  transforms.  Basic 
renewal  theory  asymptotics,  [12],  Chapter  11,  shows  that  if  E\U~\  and  P\P]  exist/are 
finite,  then 

hmzl(t)  =  £[*7]/(£[*7]  +  £[i)])  =  Ao  (2) 

t— >00  '  ' 

the  popular  (often  misused)  operational  availability.  In  cases  in  which  the  system  must 
function  throughout  a  mission  of  length  M  a  more  infonnative  measure  is 
mission  availability, 
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Am  =  Ao 


P{U>M } 


an  asymptotic  renewal  theory  result. 

Complex  systems,  starting  with  Series/Tandem  configurations,  and  extending  to 
Series/Parallel/Redundant  combinations  can  be  mathematically  analyzed  under  the 
initially  stated  independence  assumptions,  and  with  these  progressively  relaxed. 

4.  STATISTICAL  INFERENCE  ON  AVAILABILITY 


If  the  simple  assumption  of  mutually  independent  sequences  of  independent 
identically  distributed  (iid)  up-time  r.v.s,  {{//}  and  likewise  iid  down  times,  { Di }  is 
(provisionally)  acceptable  then  alternating  renewal  theory  shows  that  the  long-run  point 
availability  is  given  byAo  =  £[t/] /(£[£/]  +  £  [D]).  provided  expectations  exist.  A 

natural  estimate  of  Ao,  Ao  =  u  /  (it  +  d ) ,  where  u  and  d  are  the  sample  averages  of  up 

and  down  realizations;  assume  no  censoring;  [13]  assesses  confidence  intervals  for  Ao 
using  jackknifing,  wherein  the  Logit  transform  of  Ao  is  recomputed,  omitting  pairs  of 
observations  successively.  The  method  is  applied  with  good  success  to  various  redundant 
systems  and  several  plausible  distributional  forms.  An  alternative  would  be  the  bootstrap 
(see  [14]).  A  sensible  (semi)  nonparametric  approach  is  the  nonparametric  bootstrap: 
resample  from  the  empirical  distributions  of  up-and-down  times,  provided  that 
preliminary  examination  of  the  data  does  not  wildly  contradict  iid  assumptions;  see  [15]. 
In  many  applied  situations,  data  may  be  insufficient  to  validate  (or  invalidate)  such  an 
assumption  at  all  conclusively.  It  may  well  be  wise  to  adaptively  smooth  up  and 
down-time  means  and  employ  these  in  the  Ao  formula.  Details  are  available  from 
the  references. 
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5. 


CONCLUSION 


Many  more,  and  more  complex,  results  are  available  from  the  references.  For  an 
excellent  general  overview  see  [16]. 
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